Abstract 


Presentation 


Back  to  Agenda 


VSIPL  for  Diverse  Architectures  (Pentium  4  to  DSPs) 
Mr.  Brian  Chase 
Mr.  WenhaoWu 
Dr.  Anthony  Skjellum 
MPI  Software  Technology,  Inc 
Phone:  (662)  320-4300,  ext.  13 
Fax:  (662)320-4301 
E-mail:  brian@mpi-softtech.com 
E-mail:  wenhao@mpi-softtech.com 
E-mail:  tonv@mpi-softtech.com 


Many  companies  have  described  their  experiences  with  using  Motorola  G4  processors  to  provide 
the  VSIPL  CoreLite  standard  for  military  and  medical  computing,  and  others  have  described 
their  efforts  with  highly  optimized  Core  and  Core+  profiles. 

This  poster  uses  the  experience  of  MPI  Software  Technology's  existing  Core+  optimized  VSIPL 
implementation  for  G4  as  the  springboard  for  supporting  other  platforms  of  emerging  interest  to 
COTS,  defense,  medical  and  imaging  customers. 

Three  kinds  of  technologies  are  occurring  in  military  computing  that  may  unseat  the 
preeminence  of  low-power  G4-style  RISC  processors:  1)  The  growing  power  of  DSPs  including 
better  development  environments,  as  typified  by  TI  TMS320C6x  family  (as  opposed  to  SHARC 
2106x,  which  was  harder  to  program  strictly  from  C  and  libraries),  2)  The  use  of  medium-power 
systems  with  Pentium  4  based  blades,  3)  The  growing  perception  that  PowerPC  is  lagging 
Pentium  in  overall  performance 

Furthermore,  the  clock  speed  of  Intel  Pentium  line  of  processors  is  now  reaching  three  or  more 
GHz.  Also,  the  vector  processing  registers  (SSE)  available  on  the  Pentium  III  or  later  provide 
several  orders  of  improved  performance  for  single  precision  floating  point  operations. 
Additionally,  the  well-known  Pentium  family  provides  a  cost-effective  COTS  solution  for 
embedded  hardware  designers  as  well  as  the  end  users.  This  means  that  although  high  power 
consumption,  their  Flops/Watt  are  becoming  more  attractive. 

MPI  Software  Technology  now  has  a  fully  optimized  VSIPL  (Core  and  CoreLite)  library,  known 
as  VSI/Pro,  for  Pentium  4  /  SSE  platform  targeting  Linux,  Windows,  and  VxWorks  operating 
systems.  The  FFT  performance  of  VSI/Pro  is  consistently  better  than  Intel’s  Math  Kernel  Library 
(a.k.a  MKL),  as  illustrated  in  the  following  graph.  VSI/Pro  has  a  carefully  designed  API,  i.e., 
VSIPL,  which  is  well  fitted  to  the  needs  of  the  signal  and  image  processing  community  in  the 
defense  and  medical  industries. 
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VSI/Pro-MKL  RCFFT 
on  a  PIV  2.0  GHz  521  KB  L2  cache 
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The  Texas  Instrument’s  TMS320C67  family  of  processors  is  a  general  purpose  DSP  chip  that  is 
specifically  designed  for  FFT  and  FIR  operations.  This  family  of  processor  is  widely  used  and 
accepted  by  the  signal  processing  community  with  the  applications  including  software  radio, 
modems,  and  sonar.  Unique  features  of  this  family  of  processor  include:  1)  a  very  deep  pipeline 
and  2)  a  very  large  instruction  word  (VLIW)  architecture.  Texas  Instruments  provides  an 
integrated  development  environment  known  as  Code  Composer,  which  supports  the  C  and  C++ 
programming  languages.  Also,  Tiny  C  Compiler  has  the  ability  to  produce  correct  code  for  this 
DSP  chip.  The  porting  of  VSI/Pro  to  this  family  of  processor  has  started  by  the  hand  tuning  of 
FFT  operations  and  the  algorithm  mapping  from  the  original  RISC/CISC  implementation  to  DSP 
chip. 

Experiences,  results,  plans  and  future  work  for  supporting  Intel  Pentium  4  and  Texas  Instruments 
TMS320C67  family  within  VSI/Pro  are  described  in  this  poster. 
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Overview:  Current  VSIPL  platform  support 
Status:  G4  /  Altivec 
Widely  used  worldwide 

Domestic  production  computing  adoption  picking  up 
Helps  untie  programs  from  specific  vendors 
Expertise  on  optimizing  G4  a  major  part  of  expertise 

Expertise  on  porting  to  different  PPC  environments  also  key 
expertise 

Dealing  with  C/C++  toolchains  a  major  expertise 

Key  optimizations  for  more  advanced  users  (e.g.,  Rader’s 
algorithm  and  other  NTT-motivated  improved)  with  high 
performance  are  at  cusp  of  newest  release  efforts 

Complete  version  for  Image  processing  also  released 

Customers  have  started  asking  for  non-G4/Altivec  alternatives! 
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Availability  on  Different  Processors  Operating  System  /  Development  Tools 


•  Core+  G4  /  Altivec 

•  VxWorks,  MercuryOS, 

LynxOS,  Linux,  MacOSX 

•  Core  P4  /  SSE 

•  Windows,  Linux,  VxWorks 

•  CoreLite  TI  DSP  C67 

•  Code  Composer  toolset 

family 
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The  higher  clock  speed,  3  or  more  GHz 

COTS  technology  enables  cost  effective 
solutions 

Anticipated  lower  power  versions  from  Intel 
and  third  parties  in  future 

Not  all  embedded  systems  equally 
power/heat  constrained  even  now 

Double  precision  4 -way  vectorization  useful 

Future  winner  in  Gflop/Watt?  Gflop/$? 
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Specially  designed  architecture  for 
DSP  applications. 

-  very  deep  pipeline 

-  very  large  instruction  word  (VLIW) 
architecture 

-  streaming  data 

Better  GFlops  per  $  than  G4  /  Altivec 
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Full  Core  profile  support  for  Windows,  Linux, 
and  Vx  Works. 

Optimized  FFT  performance  for  SSE  registers 
(performance  graph  later) 

Optimized  matrix  library  easily  achievable  also 

Can  equal  or  beat  MKL  (Intel  commercial 
library)  in  significant  aspects  of  overall 
performance. . .  more  tuning  possible 
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What  we  achieved  in  1  month: 


VSI/Pro  Core  Lite  profile  is  completely 
ported  for  TI C67. 

We  have  C671 1  optimized  Complex-to- 
Complex  inplace  and  out  of  place,  forward 

and  inverse  FFTs: 

-  vsip_ccfftop_f() 

-  vsip_ccfftip_f() 

-  C6711  150Mhz  CPU  29300  cycles  for  1024  element  FFT 
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C  side:  Straightforward 

C++  side:  Strict  on  template  support 

VLIW  assembly  side:  No  hand  tuned 
assembly  code  in  the  library  yet. . .  next 
step  before  product  release 
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C67  Operating  Systems 


Example  Various  OS  platforms: 

•  SPARK  (Small  Portable  Adjustable 
Real-time  Kernel) 

•  OSE 

•  Diamond 

•  Thread  XABS  GmbH  Jena 
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Algorithm  was  engineered  for  the 
architecture  to  minimize  problems  arising 
from  the  scarcity  of  registers  and  lower 
cache  associativity. 

The  algorithm  is  auto-sort  DIF,  efficient  not 
only  on  power-of-two  sizes. 

The  key  functions  are  written  in  assembler 
supported  by  highly  optimized  C  and  C++ 
code,  using  SSE. 
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FFT  performance  on  P4 


RCFFT  Comparison  with  MKL. . . 


VSI/Pro-MKL  RCFFT 
on  a  PIV  2.0  GHz  521  KB  L2  cache 
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Interleaved  in-place  CCFFT  Comparison 
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FFT  performance  on  P4 


Split  in-place  CCFFT 


Split  Complex  to  Complex  in-place  FFT  Performance 


-  VSI/Pro 
MKL 
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FFT  optimization  for  DSP 


Using  Radix-2  ,  Radix-4  algorithms. 

Also  using  cache  splitting  ( the  LI 
cache  is  4KB,  so  the  splitting  is  needed 
for  sizes  >256  for  in-place  FFT) . 
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PRELIMINARY 
FFT  performance  on  C67 


TI  DSP  C6711  150  MHz  DSK 
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Release  the  Core  Lite  profile  library  for 
the  C67  platform 

Explore  releasing  Core  profile  library 

Explore  possibilities  of  partnering  with 
OS  vendors  such  as  OSE  Systems 
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Single-precision  optimization  not  a  big  concern  outside 
embedded  computing. . . 

Good  free  libraries  exist  (e.g.,  FFTW,  LAPACK)  and  MKL 
exists  as  alternatives 

Academic  basic  kernels  for  matrix  multiplication  (non- 
ATLAS)  are  now  mature  enough  to  use  with  small  code  size, 
but  these  are  not  open  source/redistributable  (e.g.,  libgoto) 

Several  universities  working  on  better  free  libraries 

Code  bloat  an  issue  for  certain  library  architectures  when 
considering  embedded  (e.g.,  ATLAS  code  size) 

The  merger  of  free  libraries  and  free  VSIPL  has  been  tried,  not 
as  good  a  fully  optimized  library  (e.g.,  VSIPL  ERI,  VSIPL 
Ref  Implementation  upgrade) 

Demand  for  commercial  VSIPL  for  P4  remains  a  question 


©  2001-2003  MPI  Software  Technology,  Inc. 


Distinct  flavors  of  P4  (e.g.,  Athlon)  have 
distinctively  different  optimal  libraries 

-  Cache  architecture 

-  Instruction  decode  differences 

-  TLB  and  other  memory  issues 

-  Register  file  differences  (e.g.,  16  vs  8) 

Strong  potential  that  future  embedded  P4 
clones  will  also  have  different  optimal 
choices  in  their  hardware  configurations 
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Why  we  think  it  is  useful  to  have 
commercial  VSTPL  on  P4  and  C67 

•  Shows  true  performance  portability  story  between 
diverse  architectures,  not  just  different  G4/Altivec 
OS’s  and  vendors 

•  Allows  system  designers  to  work  with  assumption 
low  software  porting  cost,  and  explore  other  aspects 
of  design  alternatives 

•  Processors  are  getting  harder  to  program 

•  Precise  mix  of  required  optimizations  for  embedded 
not  strong  emphasis  of  free  libraries  per  se 
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Demand  for  VSIPL  for  non-G4  platforms  is  TBD. . . 
appears  promising  but  not  well  developed 

Opportunities  to  achieve  extremely  high 
performance  on  clearly  different  architectures  now 
evident 

Proof  of  concept  may  help  drive  adoption 

Technical  hurdles  involving  hand-optimization 
remain  for  key  inner  kernels  on  each  new  platform, 
but  do  not  require  massive  coding  in  assembly 
language  if  handled  correctly 

C/C++  toolchain  always  an  issue  for  new  processor 
+  OS  combinations 
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